library(tidyverse)
library(knitr)
library(wooldridge)
library(dplyr)
library(haven)
library(kableExtra)
library(DT)
library(gtsummary)
library(latex2exp)
library(broom)
library(stargazer)
library(car)
library(Hmisc)
library(ggeffects)
library(lmtest)
library(sandwich)
library(fixest)Problem Set 10
Display Libraries
Exercises Overview
Exercise 1.
Reduced form
Wooldridge Exercise 15.6 (p. 524)
Display Task
Display Solution
(i)
We have the reduced form from 15.26:
\[
y_2=\pi_0+\pi_1z_1+\pi_2z_2+v_2
\] We also have the structural equation 15.22: \[
y_1=\beta_0+\beta_1y_1+\beta_2z_1+u_1
\] Inserting \(y_2\) into the expression for \(y_1\) yields the following:
\[
\Longrightarrow y_1=\beta_0+\beta_1(\pi_0+\pi_1z_1+\pi_2z_2+v_2)+\beta_2z_1+u_1
\] \[
=\beta_0+\beta_1\pi_0+\beta_1\pi_1z_1+\beta_1\pi_2z_2+\beta_1v_2+\beta_2z_1+u_1
\] \[
=\beta_0+\beta_1\pi_0+\beta_1\pi_1z_1+\beta_1\pi_2z_2+\beta_2z_1+\beta_1v_2+u_1
\] \[
=(\beta_0+\beta_1\pi_0)+(\beta_1\pi_1+\beta_2)z_1+\beta_1\pi_2z_2+(\beta_1v_2+u_1)
\] From the paranthesis we have that:
\[
\alpha_0=\beta_0+\beta_1\pi_0
\] \[
\alpha_1=\beta_1\pi_1+\beta_2
\] \[
\alpha_2=\beta_1\pi_2
\]
(ii)
Here we are asked to find the reduced error, \(v_1\), in terms of \(u_1, v_1\) and the parameters:
\[
v_1=\beta_1v_2+u_1
\]
(iii)
We are asked how we would consistently estimate \(a_j\).
We know the following holds as per definition:
\[ Cov(z_1,u_1)= \] \[ Cov(z_2,u_1)=0 \] \[ Cov(z_1,v_2)=0 \] \[ Cov(z_2,v_2)=0 \] Thus, we have that:
\[ Cov(z_1,v_1)=Cov(z_1,\beta_1v_2+u_1)=\beta_1\cdot Cov(z_1,v_2)+Cov(z_1,u_1)=0 \] And also:
\[
Cov(z_2,v_1)=0
\] This implies that we can use the OLS to yield consistent estimates of \(\alpha_j\), since MLR. 1 through MLR. 4 holds.
MLR.1 through MLR.4 (with simple explanations):
MLR.1(Linear in parameters): The model must be linear in its coefficients (the unknowns we want to estimate). This ensures that OLS makes sense mathematically.MLR.2(Random sampling): We assume that the data is a random sample from the population. This means that our sample is representative and not biased.MLR.3(No perfect collinearity): The independent variables must not be perfectly correlated. We must have variation in the explanatory variables, so they are not redundant.MLR.4(Zero conditional mean): The error term must have an expected value of zero given any value of the independent variables. Formally: \(E(u|X) = 0\). This is the key assumption that makes OLS estimates unbiased and consistent.
What these imply for the solution:
Because the covariances like \(Cov(z_1, u_1), Cov(z_1, v_2)\), etc., are zero, MLR.4 holds:
There is no correlation between the instruments \((z's)\) and the error terms
Therefore, the explanatory variables are “exogenous” — they are not contaminated by error
Combined with
MLR.1–MLR.3, this tells us that using OLS will give consistent estimates of \(a_j\)
Intuition behind the solution:
Even though \(v_1\) and \(v_2\) might be related to each other in complex ways, the instruments \(z_1\) and \(z_2\) are carefully chosen so they are uncorrelated with all the error terms. Thus, when we regress using these variables, the error part does not “sneak into” our estimates — it averages out nicely.
In short:
No correlation with errors implies no bias in estimation implies OLS is consistent.
Exercise 2.
IV assumptions
Wooldridge Exercise 15.8 (p. 525)
Display Task
Display Solution
(i)
It would be reasonable to controle for the sociodemographic background, former grades, parental income as an instrument for sociodemographic position and cognitive abilities. Also controlling for the quality of previous education instead of the length of the previous education could be of interest.
(ii)
\[
\text{score}=\beta_0+\beta_1\cdot (\text{grades}) \ + \ \beta_2\cdot (\text{parental income}) \ + \ \beta_3\cdot (\text{ability}) \ + \ \beta_4 \cdot (\text{IQ}) \ + \ \beta_5 \cdot (\text{ability}) \ + \ \beta_6\cdot(\text{girlhs}) \ + \varepsilon
\]
(iii)
We can assume a positive correlation between these two. If parental motivation on students influences the students academic performance at a girls high school this must be taken into account in the model. Parental influence on motivating their children to make certain choices is an important factor.
(iv)
For an instrumental variable not to influence the variable girlhs it must be exogeneous (not endogeneous) and relevant for the analysis. It is relevant if there is a correlation between the number of girls’ high schools and the number of girls attending girls’ high school. To be exogeneous it must not affect the high school girls’ scores in any way shape or form other than through the fact that girls attend girls’ high schools.
Exogeneity seems likely to hold here, since it does not make sense for the number of girls’ high schools to affect the scores of students execpt if it changes the probability of attending a girls’ high school.
Put differently:
The instrument must satisfy two conditions: relevance and exogeneity.
It is relevant if there is a correlation between the number of girls’ high schools within 20 miles and the likelihood that a girl attends a girls’ high school.
It is exogenous if the number of nearby girls’ high schools affects test scores only through its influence on school attendance, and not in any other direct way.
Exogeneity appears plausible in this case, since it is unlikely that the number of girls’ high schools in the area would affect a girl’s test scores except by increasing the chance that she attends one.
(v)
In respect to the relevance condition then we do really only require that we can observe some correlation between the two variables for it to hold. Thus, it is indeed a positive sign to some extend that we are able to observe some significant covariance between the two variables. We must always be able to generate some explanation on why the sign of the coefficients appear as it does. In this context we observe that an increase in girls’ high schools in the area leads to some decrease in girls attending these high schools which is nonsensical. This could be indicative that the exogeneity condition is not satisfied in its entirety. If we cannot explain why the sign is positive or negative then the theory breaks. When this happens it might be the case that \(E[u|z]=0\) does not hold. Thus, the importance of the sign is highlighted. We must be able to explain it for the exogeneity assumption to hold. If it indeed is the case that the exogeneity assumption holds we must attempt to generate a new theory that fits the results.
Exercise 3.
Time series model with measurement error
Wooldridge Exercise 15.11 (p. 525)
Display Task
Display Solution
(i)
Inserting yields the following:
\[ \Longrightarrow y_t=\beta_0+\beta_1(x_t-e_t)+u_t=\beta_0+\beta_1x_t-\beta_1e_t+u_t \] We are now able to find the covariance as follows.
\[ Cov(x_t,v_t)=Cov(x_t \ , \ -\beta_1e_t+u_t) \] \[ =-\beta_1\cdot Cov(x_t \ , \ e_t)+Cov(x_t \ , \ u_t) \] \[ =-\beta_1\cdot Cov(x_t^\ast+e_t \ , \ e_t)+Cov(x_t^\ast +e_t \ , \ u_t) \] \[ =-\beta_1\cdot (Cov(x_t^\ast \ , \ e_t) + V(e_t))+Cov(x_t^\ast \ , \ u_t) + Cov(e_t \ , \ u_t) \] \[ =-0-\beta_1 \cdot V(e_t)+0+0 \] \[ -\beta_1 \cdot V(e_t) \] We have that \(V(e_t) ≥ 0\). Thus, the covariance is per definition negative if \(\beta_1>0\). If \(\beta_1<0\) then we have that the covariance will take on a positive value.
The estimator is biased if TS.3 does not hold. If a correlation between \(x_t\) and the error term persists bias is introduced because TS.3 is not satisfied.
In this case we are considering the omitted variable bias.
Formally:
We are given the true model:
\[y_t = \beta_0 + \beta_1 x_t^* + u_t\]
and observe
\[x_t = x_t^* + e_t\]
We can substitute for \(x_t^*\):
\[x_t^* = x_t - e_t\]
Substituting into the true model:
\[y_t = \beta_0 + \beta_1 (x_t - e_t) + u_t = \beta_0 + \beta_1 x_t - \beta_1 e_t + u_t\]
Define the new error term as:
\[v_t = -\beta_1 e_t + u_t\]
Thus, the model becomes:
\[y_t = \beta_0 + \beta_1 x_t + v_t\]
For the OLS estimator of \(\beta_1\) to be unbiased, we require that:
\[Cov(x_t, v_t) = 0\]
We now compute \(Cov(x_t, v_t)\):
\[Cov(x_t, v_t) = Cov(x_t, -\beta_1 e_t + u_t)\] \[= -\beta_1 \, Cov(x_t, e_t) + Cov(x_t, u_t)\]
Since \(x_t = x_t^* + e_t\), we have:
\[Cov(x_t, e_t) = Cov(x_t^*, e_t) + Var(e_t)\]
By assumption, \(x_t^*\) and \(e_t\) are uncorrelated, so \(Cov(x_t^*, e_t) = 0\). Therefore:
\[Cov(x_t, e_t) = Var(e_t)\]
Additionally, \(x_t\) is uncorrelated with \(u_t\), implying \(Cov(x_t, u_t) = 0\).
Substituting back:
\[Cov(x_t, v_t) = -\beta_1 \, Var(e_t)\]
Because \(Var(e_t) \geq 0\), the sign of the covariance depends on the sign of \(\beta_1\).
If \(\beta_1 > 0\), then \(Cov(x_t, v_t) < 0\).
If \(\beta_1 < 0\), then \(Cov(x_t, v_t) > 0\).
Since \(Cov(x_t, v_t) \neq 0\), the OLS assumption TS.3 (zero correlation between the regressor and the error term) is violated. Consequently, the OLS estimator of \(\beta_1\) is biased.
This bias is known as attenuation bias, meaning that if \(\beta_1 > 0\), the OLS estimator will underestimate the true \(\beta_1\), pulling the estimate toward zero.
(ii)
Here we strive to show that the covariance between the two variables equates with zero as follows.
\[ Cov(x_{t-1},v_t)=Cov(x_{t-1},-\beta_1e_t+u_t) \] \[ =-\beta_1\cdot Cov(x_{t-1},e_t)+Cov(x_{t-1},u_t) \] \[ =-\beta_1\cdot Cov(x_{t-1}^\ast+e_{t-1},e_t)+Cov(x_{t-1}^\ast+e_{t-1},u_t) \] \[ =-\beta_1\cdot Cov(x_{t-1}^\ast,e_t)-\beta_1\cdot Cov(e_{t-1},e_t)+Cov(x_{t-1}^\ast,u_t)+Cov(e_{t-1},u_t) \] \[ =-0-0+0+0 \] \[ =0 \]
Through the alternative covariance formula we can put it as follows.
\[
Cov(x_{t-1},v_t)=E[x_{t-1}\cdot v_t]-E[x_{t-1}]\cdot E[v_t]=0
\] \[
=E[(x_{t-1}^\ast+e_{t-1})\cdot v_1]-E[x_{t-1}^\ast+e_{t-1}]\cdot E[v_t]=0
\] \[
=E[(x_{t-1}^\ast+e_{t-1})\cdot v_t]-E[x_{t-1}^\ast]\cdot E[v_t]+E[e_{t-1}]\cdot E[v_t]=0
\] \[
=E[(x_{t-1}^\ast+e_{t-1})\cdot v_t]-0\cdot 0+0\cdot 0 = 0
\] \[
=E[x_{t-1}\cdot v_t]=0
\]
(iii)
Different chocks might have a long lasting effect on the following periods. Thus, the two variables \(x_t\) and \(x_{t-1}\) quite possibly correlated in some way shape or form. This could mathematically be stated as follows.
\[ Cov(x_t,x_{t-1})=Cov(x_t^\ast+e_t,x_{t-1}^\ast+e_{t-1}) \] \[ =Cov(x_t^\ast,x_{t-1}^\ast)+Cov(e_t,x_{t-1}^\ast)+Cov(x_t^\ast,e_{t-1})+Cov(e_t,e_{t-1}) \] \[ =Cov(x_t^\ast,x_{t-1}^\ast)+0+Cov(x_t^\ast,e_{t-1})+0 \] \[ =Cov(x_t^\ast,x_{t-1}^\ast)+Cov(x_t^\ast,e_{t-1}) \] Hence, \(e_t\) is solely uncorrelated with previous values of \(x_t^\ast\). Thus, the expression above is the result of this. We are aware of the fact that shocks most often do not appear in one period while being of insignificance the next period. Most often the effects of shocks does have a lasting effect in the following periods. Hence, \(x_t\) and \(x_{t-1}\) is much likely correlated in some way shape or form. This is due to the reason that \(Cov(x_t^\ast,e_{t-1})\neq 0\)
(iv)
We have shown that \(x_t\) is most likely an endogenous variable in the equation. Thus, we must use an instrumental variable (IV). This would fix the problem of endigeneity. In (ii) and (iii) we showed that \(x_{t-1}\) satisfies the relevance and exogeneity requirements implying that it is well suited to be used as an instrumental variable.
Instrumental variables are used when an explanatory variable is correlated with the error term, causing bias in ordinary least squares (OLS) estimation. An instrumental variable is a third variable that is correlated with the problematic explanatory variable but uncorrelated with the error term. It allows consistent estimation of the causal effect even when standard OLS assumptions fail.
Exercise 4.
IV estimation
Wooldridge Exercise 15.C1 (iv)-(vi) (p. 526)
Display Task
Display Solution
(i)
The first 3 questions was answered in problem set 9.
Flashback to ProblemSet9
(i)
Display Code
load("C:/Users/laust/Documents/Fag/4. Sem/Econometrics/ProblemSets/ProblemSet9/Problem set 9/wage2.RData")model <- lm(lwage ~ sibs, data = data)tidy_model <- tidy(model)
kbl(tidy_model, format = "html", digits = 4, booktabs = TRUE) %>% kable_styling(full_width = FALSE)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 6.8611 | 0.0221 | 310.7714 | 0 |
| sibs | -0.0279 | 0.0059 | -4.7230 | 0 |
We observe that the coefficient on sibs is different from the one where sibs is used as an instrumental variable for educ.
This regression estimates the direct effect of the number of siblings (sibs) on log wages (lwage). It does not account for endogeneity or omitted variable bias. For instance, families with more children might differ systematically in unobserved ways that also influence wages, such as socioeconomic status or parental time.
In contrast, when sibs is used as an IV for educ, we isolate the variation in education that is driven by the number of siblings, and then estimate how that variation in education affects wages. Thus, the interpretation and estimation approach differ.
(ii)
The hypothesis is that birth order (brthord) affects education (educ), since earlier-born children may receive more parental investment. This idea is particularly relevant in contexts like the US.
Display Code
model2 <- lm(educ ~ brthord, data = data)tidy_model2 <- tidy(model2)
kbl(tidy_model2, format = "html", digits = 4, booktabs = TRUE) %>% kable_styling(full_width = FALSE)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 14.1494 | 0.1287 | 109.9623 | 0 |
| brthord | -0.2826 | 0.0463 | -6.1062 | 0 |
We observe a statistically significant negative coefficient, suggesting that higher birth order (i.e., being later-born) is associated with fewer years of education.
This supports the relevance condition for IV estimation: brthord does predict educ. However, for brthord to be a valid instrument, it must also satisfy the exogeneity condition — it should not be correlated with unobserved factors that directly affect wages (i.e., the error term in the wage equation).
(iii)
Display Code
iv_model <- feols(lwage ~ 1 | educ ~ brthord, data = data)NOTE: 83 observations removed because of NA values (IV: 0/83).
tidy_iv_model <- tidy(iv_model)
kbl(tidy_iv_model, format = "html", digits = 4, booktabs = TRUE) %>% kable_styling(full_width = FALSE)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 5.0304 | 0.4329 | 11.6189 | 0 |
| fit_educ | 0.1306 | 0.0320 | 4.0777 | 0 |
We estimate the effect of education on log wages using birth order as an instrument. The result shows that one more year of education increases wages by approximately 13% on average.
However, this estimate is not a general effect across the population. It is a local average treatment effect (LATE) — it applies specifically to the subset of individuals whose educational attainment is influenced by their birth order.
In other words:
This estimate reflects how wage outcomes are affected by variation in education that is induced by differences in birth order. It does not tell us the average effect of education for everyone — only for those whose education decisions are affected by the instrument.
We can test if the relevance assumption is satisfied by using the reduced form for education as follows.
\[ educ=\pi_0+\pi_1\cdot \text{sibs}+\pi_2\cdot\text{brthord}+v \] The relevance assumption states that \(\pi_2\neq0\). This is the so called first stage equation. Using R yields the following result.
Display Code
load("C:/Users/laust/Documents/Fag/4. Sem/Econometrics/ProblemSets/ProblemSet9/Problem set 9/wage2.RData")model <- feols(educ ~ sibs + brthord, data=data)NOTE: 83 observations removed because of NA values (RHS: 83).
tidy_model <- tidy(model)
kbl(tidy_model, format="html", digits = 4, booktabs=TRUE) %>% kable_styling(full_width = FALSE)The regression becomes as follows.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 14.2965 | 0.1333 | 107.2601 | 0.0000 |
| sibs | -0.1529 | 0.0399 | -3.8341 | 0.0001 |
| brthord | -0.1527 | 0.0571 | -2.6749 | 0.0076 |
Observing the coefficients we conclude that the current model states that \(\hat{\pi}_2=-0,1529\). This is statistically significant at \(\text{p-value}\approx0\).
The regression output presents estimates for a model with two explanatory variables: \(sibs\) and \(brthord\).
The estimated equation is:
\[\widehat{y}_t = 14.2965 - 0.1529 \, sibs_t - 0.1527 \, brthord_t\]
The interpretation of the coefficients is as follows:
The intercept is \(14.2965\), meaning that when both \(sibs\) and \(brthord\) are zero, the predicted value of the dependent variable is \(14.2965\).
The coefficient on \(sibs\) is \(-0.1529\). This means that, holding \(brthord\) constant, an additional sibling is associated with a decrease of approximately \(0.1529\) units in the dependent variable. The \(p\)-value is \(0.0001\), which is highly statistically significant at conventional levels (such as \(1\%\) and \(5\%\)).
The coefficient on \(brthord\) is \(-0.1527\). This suggests that, holding \(sibs\) constant, an increase in birth order by one position (e.g., moving from second-born to third-born) is associated with a decrease of approximately \(0.1527\) units in the dependent variable. The \(p\)-value is \(0.0076\), indicating statistical significance at the \(1\%\) level.
Since both explanatory variables have negative and statistically significant coefficients, the model suggests that larger family size and later birth order are associated with lower values of the dependent variable - education.
(v)
The regression is made in R.
Display Code
The following line estimates a 2SLS (Two-Stage Least Squares) regression using the fixest package in R.
model <- feols(lwage ~ sibs | educ ~ brthord, data=data)NOTE: 83 observations removed because of NA values (IV: 0/83).
tidy_model <- tidy(model)
kbl(tidy_model, format="html", digits=4, booktabs = TRUE) %>% kable_styling(full_width=FALSE)Using 2sls yields a model with the following coefficients.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 4.9385 | 1.0557 | 4.6780 | 0.0000 |
| fit_educ | 0.1370 | 0.0747 | 1.8344 | 0.0669 |
| sibs | 0.0021 | 0.0174 | 0.1215 | 0.9033 |
The regression output presents the two-stage least squares (2SLS) estimates after instrumenting \(educ\) with \(brthord\) and treating \(sibs\) as its own instrument.
The key results are:
The coefficient on \(fit\_educ\) is \(0.1370\), suggesting that an additional year of education is associated with an increase of \(0.1370\) units in the dependent variable, holding \(sibs\) constant. The \(p\)-value is \(0.0669\), indicating marginal statistical significance at the \(10\%\) level but not at the \(5\%\) level.
The coefficient on \(sibs\) is \(0.0021\), with a \(p\)-value of \(0.9033\), indicating no statistical significance. The effect of the number of siblings on the dependent variable is extremely small and not different from zero.
Regarding the standard errors:
The standard error for \(fit\_educ\) is relatively large compared to its coefficient (\(0.0747\) versus \(0.1370\)), explaining why the \(t\)-statistic is modest and the \(p\)-value is only marginally significant.
The standard error for \(sibs\) is also large relative to its very small coefficient, leading to a very small \(t\)-statistic and high \(p\)-value.
In conclusion, using \(brthord\) as an IV for \(educ\) increases the standard error for \(\hat{\beta}_{educ}\) compared to ordinary least squares (OLS), reflecting the loss of precision that typically occurs when using instrumental variables. The coefficient on \(sibs\) remains statistically insignificant.
(vi)
We run a predicting model in R.
Display Code
# Run the regression
first_stage <- feols(educ ~ sibs + brthord, data = data)NOTE: 83 observations removed because of NA values (RHS: 83).
# Create a cleaned dataset that only contains non-missing observations
data_clean <- data[!is.na(data$educ) & !is.na(data$sibs) & !is.na(data$brthord), ]
# Predict fitted values
educ_predicted <- predict(first_stage)
# Now calculate correlation
cor(educ_predicted, data_clean$sibs)[1] -0.9294818
The correlation is as follows.
[1] -0.9294818
We see that the correlation is very close to -1. This is a very high correlation which becomes a problem when we try to estimate the regression on lwage because the high correlation between the 2 explanetory variables causes multicollinearity causing the result to be distorted. This likely causes the coefficients to become insignificant.
The intuition behind this exercise is that higher standard deviations is caused by the explanetory variables having a high correelation. Then it becomes quite difficult to decide how much of the variation that is caused by the explanetory variables. This causes the standard deviation to rise for both variables.
Exercise 5.
Estimation of reduced form and structural model
Wooldridge Exercise 15.C2 (p. 526)
Display Task
Display Solution
(i)
The regression is carried out in R as folllows.
Display Code
data <- read_dta("C:/Users/laust/Documents/Fag/4. Sem/Econometrics/data/fertil2.dta")data$age_squared <- (data$age)^2model <- feols(children ~ educ + age + age_squared, data=data)tidy_model <- tidy(model)
kbl(tidy_model, format="html", digits = 4, booktabs = TRUE) %>% kable_styling(full_width = FALSE)The regression yields the estimated model as follows.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -4.1383 | 0.2406 | -17.2004 | 0 |
| educ | -0.0906 | 0.0059 | -15.2981 | 0 |
| age | 0.3324 | 0.0165 | 20.0882 | 0 |
| age_squared | -0.0026 | 0.0003 | -9.6511 | 0 |
In this model we expect - ceteris paribus (all else equal), holding age fixed - women to have 0,0906 less children on average gaining another year of education.
If 100 women recieved another year of education the decrease in number of children would be \(0,0906\cdot100=9,06\).
(ii)
The assumption ensures that the exogeneity condition is satisfied, so we now need to verify that the relevance condition is also met. To check this, we can run the reduced form regression of education on age and frsthalf and see if the instrument has a significant effect.
Display Code
model_2 <- feols(educ ~ frsthalf + age + age_squared, data=data)The new regression yields the following results.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 9.6929 | 0.5981 | 16.2069 | 0.0000 |
| frsthalf | -0.8523 | 0.1128 | -7.5537 | 0.0000 |
| age | -0.1080 | 0.0420 | -2.5678 | 0.0103 |
| age_squared | -0.0005 | 0.0007 | -0.7296 | 0.4657 |
The effect is quite significant with a \(\text{p-level}=0\). Thus, it is a reasonable instrumental variable.
(iii)
We conduct the regression analysis in R using 2sls.
Display Code
model_2sls <- feols(children ~ age + age_squared | educ ~ frsthalf, data=data)tidy_model_2sls <- tidy(model_2sls)
etable_model_2sls_model <- etable(model, model_2sls)
kbl(etable_model_2sls_model, format="html", digits=4, booktabs=TRUE) %>% kable_styling(full_width = FALSE)The 2sls-model compared with the model from (i) is as follows.
| model | model_2sls | |
|---|---|---|
| Dependent Var.: | children | children |
| Constant | -4.138*** (0.2406) | -3.388*** (0.5482) |
| educ | -0.0906*** (0.0059) | -0.1715** (0.0532) |
| age | 0.3324*** (0.0165) | 0.3236*** (0.0179) |
| age_squared | -0.0026*** (0.0003) | -0.0027*** (0.0003) |
| _______________ | ___________________ | ___________________ |
| S.E. type | IID | IID |
| Observations | 4,361 | 4,361 |
| R2 | 0.56872 | 0.55023 |
| Adj. R2 | 0.56843 | 0.54992 |
We observe that the coefficient becomes greater using an analysis that takes into account this instrumental variable.
In the OLS model, the estimated coefficient on \(educ\) is \(-0.0906\), which is statistically significant at the 1% level. This means that each additional year of education is associated with approximately \(0.09\) fewer children, holding \(age\) and \(age\_squared\) constant.
In the 2SLS model, where \(frsthalf\) is used as an instrument for \(educ\), the estimated coefficient on \(educ\) is \(-0.1715\), statistically significant at the 5% level. This suggests that each additional year of education reduces the number of children by about \(0.17\).
The 2SLS estimate is larger in magnitude than the OLS estimate, indicating that OLS likely underestimates the true negative effect of education on fertility, possibly due to omitted variable bias. The \(R^2\) also drops slightly when using 2SLS, which is expected.
(iv)
We conduct the analysis in R.
Display Code
modelwithols <- feols(children ~ educ + age + age_squared + electric + tv + bicycle, data=data)NOTE: 5 observations removed because of NA values (RHS: 5).
modelwith2sls <- feols(children ~ age + age_squared + electric + tv + bicycle | educ ~ frsthalf, data=data)NOTE: 5 observations removed because of NA values (RHS: 5).
etable_models <- etable(modelwithols, modelwith2sls)kbl(etable_models, format="html", digits=4, booktabs = TRUE) %>% kable_styling(full_width=FALSE)The analysis yields the following results.
| modelwithols | modelwith2sls | |
|---|---|---|
| Dependent Var.: | children | children |
| Constant | -4.390*** (0.2403) | -3.591*** (0.6451) |
| educ | -0.0767*** (0.0064) | -0.1640* (0.0655) |
| age | 0.3402*** (0.0164) | 0.3281*** (0.0191) |
| age_squared | -0.0027*** (0.0003) | -0.0027*** (0.0003) |
| electric | -0.3027*** (0.0762) | -0.1065 (0.1660) |
| tv | -0.2531** (0.0914) | -0.0026 (0.2092) |
| bicycle | 0.3179*** (0.0494) | 0.3321*** (0.0515) |
| _______________ | ___________________ | ___________________ |
| S.E. type | IID | IID |
| Observations | 4,356 | 4,356 |
| R2 | 0.57606 | 0.55766 |
| Adj. R2 | 0.57548 | 0.55705 |
In the OLS model, the estimated coefficient on \(educ\) is \(-0.0767\), statistically significant at the 1% level. In the 2SLS model, the coefficient on \(educ\) is \(-0.1640\), statistically significant at the 10% level. This again shows that OLS underestimates the negative effect of education on fertility, likely due to omitted variable bias.
The binary variable \(tv\) has a coefficient of \(-0.2531\) in the OLS model, significant at the 5% level, but becomes insignificant in the 2SLS model. In the OLS results, television ownership is associated with having approximately \(0.25\) fewer children. A possible explanation is that owning a television changes household behavior, for example by increasing exposure to family planning information or modern lifestyles, which could lead to lower fertility rates.
The addition of the exogenous variables \(electric\), \(tv\), and \(bicycle\) only slightly changes the estimated effect of education compared to models without these controls. The \(R^2\) remains relatively stable, suggesting that these variables add some explanatory power without dramatically altering the main relationships.
The results for \(educ\) are very similar to those found earlier. We observe that \(tv\) ownership is associated with fewer children in the OLS model, but this effect disappears when using IV estimation. This suggests that the effect of \(tv\) may not be entirely exogenous.
Television ownership was likely not widespread in Botswana in 1988, meaning that mainly wealthier families owned a TV, and these families also tended to have fewer children on average. However, this challenges the assumption that \(tv\) is exogenous.
It makes sense to add the variables \(electric\), \(tv\), and \(bicycle\) because they can capture household characteristics that are related to fertility decisions. These variables are relevant because they proxy for the household’s wealth, access to modern amenities, and lifestyle choices, all of which can influence the number of children a family chooses to have.
For example, having electricity (\(electric\)) might be associated with better access to information, healthcare, and family planning. Owning a television (\(tv\)) can expose families to ideas about smaller family norms and modern lifestyles. Owning a bicycle (\(bicycle\)) could reflect better mobility and access to services, which may also impact fertility behavior.
Thus, all three variables are relevant as they help control for important factors that could otherwise bias the estimated effect of \(educ\) on fertility.
Exercise 6.
2SLS
Wooldridge Exercise 15.C13 (p. 530)
Display Task
Display Solution
(i)
The analysis is conducted using the programming language R.
Display Code
data <- read_dta("C:/Users/laust/Documents/Fag/4. Sem/Econometrics/ProblemSets/ProblemSet10/set 10/labsup.dta")model0 <- feols(hours ~ kids + nonmomi + educ + age + age^2 + black + hispan, vcov="hetero", data=data)The analysis yields the following results.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -10.4470 | 6.5900 | -1.5853 | 0.1129 |
| kids | -2.3258 | 0.1155 | -20.1322 | 0.0000 |
| nonmomi | -0.0578 | 0.0054 | -10.8053 | 0.0000 |
| educ | 0.5860 | 0.0375 | 15.6309 | 0.0000 |
| age | 2.0488 | 0.4484 | 4.5690 | 0.0000 |
| I(age^2) | -0.0277 | 0.0077 | -3.6018 | 0.0003 |
| black | 1.0583 | 1.3542 | 0.7815 | 0.4345 |
| hispan | -5.1141 | 1.3549 | -3.7747 | 0.0002 |
The coefficient on \(kids\) is estimated to be \(-2.3258\), meaning that, holding all other variables constant, having one additional child is associated with a decrease of approximately \(2.33\) hours worked. This effect is highly statistically significant, with a t-statistic of \(-20.1322\) and a p-value of \(0.0000\), suggesting strong evidence against the null hypothesis of no effect.
However, we have good reason to believe that \(kids\) may be endogenous. The number of children could be correlated with unobserved factors in the error term that also influence labor supply decisions. For example, personal preferences for family life versus career, health conditions, or access to childcare might affect both how many children a person has and how much they work. If such unobserved factors are present, the OLS estimate would be biased and inconsistent.
(ii)
The variable \(samesex\) appears to be a relevant instrument because the biological sex of the first two children likely influences a family’s decision to have additional children. Specifically, many parents have a preference for a mixed-gender set of children (one boy and one girl). If the first two children are of the same sex, parents may be more inclined to try for a third child, whereas if they already have one of each, they may feel that their family is complete. This creates a plausible positive correlation between \(samesex\) and the total number of children (\(kids\)).
However, although \(samesex\) may satisfy the relevance condition, its exogeneity is less certain. While the gender of children is biologically random and should not directly affect parental labor supply decisions, there could be indirect channels that violate the exclusion restriction. For example, raising two boys versus two girls might systematically affect time demands, stress levels, or child-rearing costs in ways that influence hours worked, independent of the number of children. If, for instance, boys are more likely to have behavioral issues that require parental attention, \(samesex\) could have a direct effect on labor supply beyond its effect through fertility decisions.
In summary, the argument for \(samesex\) as a relevant instrument is strong based on typical parental preferences for mixed-sex offspring. However, careful consideration must be given to the exclusion restriction, as gender composition could affect labor market behavior through mechanisms other than family size.
(iii)
We conduct the analysis in R.
Display Code
model1 <- feols(kids ~ samesex + nonmomi + educ + age + age^2 + black + hispan, vcov="hetero", data=data)model1_summary <- tidy(summary(model1, vcov=vcovHC(model1, type="HC2"), fitstat="all"))
kbl(model1_summary, format="html", digits=4, booktabs=TRUE) %>% kable_styling(full_width=FALSE)The analysis yields the following results.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 2.0103 | 0.2931 | 6.8591 | 0.0000 |
| samesex | 0.0704 | 0.0103 | 6.8469 | 0.0000 |
| nonmomi | -0.0028 | 0.0003 | -10.8448 | 0.0000 |
| educ | -0.0854 | 0.0020 | -42.0590 | 0.0000 |
| age | 0.0589 | 0.0203 | 2.8989 | 0.0037 |
| I(age^2) | 0.0000 | 0.0004 | 0.0056 | 0.9956 |
| black | 0.0129 | 0.0646 | 0.1992 | 0.8421 |
| hispan | -0.0425 | 0.0647 | -0.6569 | 0.5113 |
The estimated coefficient on \(samesex\) is positive, as expected, and highly statistically significant. The point estimate is \(0.0704\), with a p-value of \(0.0000\), meaning it is significant at any conventional level (1%, 5%, and 10%). This result supports the hypothesis that parents whose first two children are of the same biological sex are more likely to have additional children.
The positive and significant coefficient suggests that, on average, having two children of the same sex increases the total number of children by about \(0.07\). Although the magnitude may seem small, the key point is that \(samesex\) is strongly correlated with \(kids\), fulfilling the relevance condition necessary for a valid instrument.
Thus, based on this regression, the first-stage relationship required for instrumental variable estimation appears to hold: \(samesex\) is a strong predictor of fertility behavior, consistent with the argument that many parents prefer to have at least one boy and one girl. When we say that \(samesex\) is a strong predictor of fertility behavior, we mean that the sex composition of the first two children significantly influences the decision to have more children.
The regression shows that parents whose first two children are of the same sex are more likely to have additional children. This supports the theory that many parents desire to have at least one boy and one girl. Therefore, if the first two children are both boys or both girls, parents are more motivated to continue childbearing to achieve a mixed-gender family.
The strong positive and statistically significant coefficient on \(samesex\) confirms that the biological sex mix of the first two children is an important factor in shaping fertility decisions.
While the positive and statistically significant coefficient on \(samesex\) suggests that it predicts fertility behavior, several critical points should be considered:
First, although \(samesex\) appears to be a strong predictor, the estimated effect size is relatively small (around 0.07 children). While statistically significant due to the large sample size, the economic significance may be limited. A change of 0.07 children is minor in practice, and it may not reflect a major behavioral shift among parents.
Second, the assumption that parents universally prefer a mixed-gender set of children may not hold across different cultures, socioeconomic groups, or individual family preferences. The strength of the preference for a boy and a girl could vary substantially, weakening the interpretation that \(samesex\) consistently influences fertility.
Third, although the instrument appears relevant, this result does not guarantee that the exclusion restriction is satisfied. It is possible that having two children of the same sex affects family decisions beyond fertility, such as parental labor supply, time allocation, investments in children, or aspirations for future earnings. If such direct effects exist, \(samesex\) would not be a valid instrument because it would violate the assumption that the instrument affects the outcome (e.g., hours worked) only through the endogenous variable (number of children).
Fourth, even if \(samesex\) is random at conception, sample selection issues could arise. For example, families with strong gender preferences might engage in selective stopping or selective fertility behaviors, complicating the randomness assumption.
In summary, while the statistical evidence supports \(samesex\) as a relevant instrument, concerns about the size of the effect, cultural heterogeneity, potential violations of the exclusion restriction, and sample selection bias must be taken seriously before concluding that \(samesex\) is a fully valid instrument for fertility decisions.
(iv)
One possible mechanism linking \(samesex\) to the error term \(u\) relates to differences in child development and behavioral issues across genders. If, for example, girls are more likely to face certain emotional or social challenges during childhood, then families with two girls might experience greater demands on parental time and resources, independently of the number of children. This could influence labor supply decisions directly, violating the exclusion restriction by creating a direct pathway from \(samesex\) to hours worked that does not operate solely through fertility.
Another potential mechanism involves the financial costs of raising children. Families with two children of the same sex may be able to economize by reusing clothing, toys, school supplies, and other goods, leading to lower child-rearing expenses compared to families with children of different sexes. As a result, parents with two same-sex children may experience reduced financial pressure, which could allow them to work fewer hours or choose jobs with fewer hours and better flexibility. Again, this would introduce a direct effect of \(samesex\) on labor supply, unrelated to fertility, and would compromise the exogeneity of the instrument.
In both cases, \(samesex\) could be correlated with unobserved factors that affect hours worked, implying that while biological sex is random, the downstream effects on family life are not necessarily neutral with respect to labor market outcomes.
(v)
It is not legitimate to test for the exogeneity of \(samesex\) simply by adding it directly to the structural equation and checking its significance. Doing so does not provide a valid test because the structural equation is already potentially affected by omitted variable bias. If important unobserved factors are missing from the model, the coefficient on \(samesex\) could appear statistically significant even if \(samesex\) is truly exogenous, or it could appear insignificant even if \(samesex\) is endogenous.
In other words, the presence of omitted variables distorts the estimated relationship between \(samesex\) and the outcome variable, making it impossible to draw reliable conclusions about exogeneity based solely on its significance. A proper test for exogeneity would require an approach like an overidentification test (e.g., Hansen’s J test) in a setting with multiple instruments, rather than simply adding the instrument to the outcome regression.
(vi)
We conduct the analysis in R.
Display Code
model2 <- feols(hours ~ nonmomi + educ + age + age^2 + black + hispan | kids ~ samesex, vcov="hetero", data=data)sum_model2 <- summary(model2, vcov=vcovHC(model2, type="HC2"), fitstat="all")
etable_model0_model2 <- etable(model0, sum_model2)
kbl(etable_model0_model2, format="html", digits=4, booktabs=TRUE) %>% kable_styling(full_width = FALSE)The analysis becomes as follows.
| model0 | sum_model2 | |
|---|---|---|
| Dependent Var.: | hours | hours |
| Constant | -10.45 (6.589) | -5.254 (NaN) |
| kids | -2.326*** (0.1155) | -4.879 (NaN) |
| nonmomi | -0.0578*** (0.0054) | -0.0649 (NaN) |
| educ | 0.5860*** (0.0375) | 0.3680 (NaN) |
| age | 2.049*** (0.4484) | 2.201 (NaN) |
| age square | -0.0277*** (0.0077) | -0.0277 (NaN) |
| black | 1.058 (1.351) | 1.095 (NaN) |
| hispan | -5.114*** (1.352) | -5.218 (NaN) |
| _______________ | ___________________ | _____________ |
| S.E. type | Heteroskedast.-rob. | Custom |
| Observations | 31,857 | 31,857 |
| R2 | 0.07270 | 0.05826 |
| Adj. R2 | 0.07250 | 0.05805 |
When we use \(samesex\) as an instrument for \(kids\), we find that the IV estimate of the effect of \(kids\) on \(hours\) is approximately \(-4.879\), compared to the OLS estimate of \(-2.326\). This suggests that the IV estimate is larger in magnitude than the OLS estimate, indicating that OLS may underestimate the true negative effect of additional children on hours worked.
However, the IV estimates are not very precise. The standard errors are not reported (NaNs appear), due to numerical instability issues related to the HC2 variance estimation. This instability often occurs when there are influential observations or nearly perfect predictions in small groups created by the instrument.
The IV estimates are not very precise because the computation of robust standard errors using the HC2 method became numerically unstable. This instability is reflected in the output by the appearance of NaNs (Not a Number) for the standard errors.
In IV regressions, numerical instability often arises when the first stage is relatively weak or when there is very little variation in the instrument relative to the endogenous regressor. Specifically, if a small subgroup of the sample strongly determines the fitted values of \(kids\) based on \(samesex\), the model’s leverage values (hat values) become very high — close to 1. HC2 corrections are based on adjusting for leverage, and when leverage is extremely high, the denominator in the HC2 formula approaches zero, leading to division by very small numbers or undefined results.
Another way to see this is that the instrument \(samesex\) partitions the sample into groups (same-sex versus mixed-sex families). If these groups are small, unbalanced, or if one group almost perfectly predicts \(kids\), it can cause the model to “overfit” within the group, making standard error calculations highly sensitive and unstable.
Therefore, while the coefficient estimates can still be computed, the standard errors — which measure how much uncertainty there is around the estimates — become unreliable or cannot be calculated at all with HC2 adjustments. This lack of precision makes it difficult to assess the true statistical significance of the IV estimates without fixing the numerical instability.
The fact that the IV coefficient is more negative than the OLS coefficient is consistent with a downward bias in OLS caused by endogeneity — for example, if parents who are more career-oriented have fewer children, biasing OLS toward zero.
Nevertheless, because of the large uncertainty and instability in the IV results, we must be cautious. The lack of precise standard errors means that we cannot be fully confident in the statistical significance or reliability of the IV estimate without addressing the instability or trying alternative robust methods.
Solution
Instead of using the HC2 correction for standard errors we go one step back and use the HC1 correction instead.
Thus the analysis must be conducted as follows.
Display Code
model2 <- feols(hours ~ nonmomi + educ + age + age^2 + black + hispan | kids ~ samesex, vcov="hetero", data=data)sum_model2 <- summary(model2, vcov=vcovHC(model2, type="HC1"), fitstat="all")
etable_model0_model2 <- etable(model0, sum_model2)
kbl(etable_model0_model2, format="html", digits=4, booktabs=TRUE) %>% kable_styling(full_width = FALSE)Which yields the following result.
| model0 | sum_model2 | |
|---|---|---|
| Dependent Var.: | hours | hours |
| Constant | -10.45 (6.589) | -5.254 (166.7) |
| kids | -2.326*** (0.1155) | -4.879 (81.92) |
| nonmomi | -0.0578*** (0.0054) | -0.0649 (0.2274) |
| educ | 0.5860*** (0.0375) | 0.3680 (6.995) |
| age | 2.049*** (0.4484) | 2.201 (4.907) |
| age square | -0.0277*** (0.0077) | -0.0277*** (0.0079) |
| black | 1.058 (1.351) | 1.095 (1.824) |
| hispan | -5.114*** (1.352) | -5.218 (3.603) |
| _______________ | ___________________ | ___________________ |
| S.E. type | Heteroskedast.-rob. | Custom |
| Observations | 31,857 | 31,857 |
| R2 | 0.07270 | 0.05826 |
| Adj. R2 | 0.07250 | 0.05805 |
Thus the interpretation becomes as follows.
The new results using HC1 standard errors instead of HC2 show the same point estimates for the coefficients but now provide meaningful (finite) standard errors, making the IV results interpretable.
Interpretation of the main results:
The coefficient on \(kids\) in the OLS model is \(-2.326\), and in the IV model it is \(-4.879\).
This indicates that each additional child reduces hours worked by about 2.33 hours according to OLS, but about 4.88 hours according to IV.
The IV estimate is larger in absolute value, suggesting that OLS likely suffers from attenuation bias due to endogeneity (e.g., unobserved preferences affecting both fertility and labor supply).
The standard errors under HC1 are now reported and meaningful. For example, the standard error for \(kids\) in the IV model is \(81.92\), which is extremely large compared to the size of the coefficient (\(-4.879\)), implying that the IV estimate is very imprecise and not statistically significant.
Other variables like \(nonmomi\), \(educ\), \(age\), and \(hispan\) also show effects, some of which remain statistically significant.
Why the results differ from HC2:
HC2 corrects more aggressively for leverage by scaling residuals according to their individual hat values (leverage scores).
When leverage is very high (near 1), HC2 becomes unstable and can produce NaNs or extremely large variances.
HC1, in contrast, simply adjusts the standard errors proportionally to the sample size and does not account for leverage, making it more stable but slightly less robust in the presence of influential observations.
Which measure is better and when:
HC2 is theoretically better if you are concerned about heteroskedasticity and high leverage points because it adjusts for leverage individually.
HC1 is better when you have high leverage problems causing numerical instability (as here) because it remains computationally stable.
In small samples or when leverage is extreme, HC2 can be unreliable. In larger, balanced samples, HC2 would usually be preferable.
Interpretation of \(R^2\) and adjusted \(R^2\):
\(R^2\) measures the proportion of variation in \(hours\) explained by the regressors.
- In OLS, \(R^2 = 0.0727\), meaning about 7.27% of the variation in working hours is explained by \(kids\), \(nonmomi\), \(educ\), \(age\), \(age^2\), \(black\), and \(hispan\).
In the IV model, \(R^2 = 0.0583\), lower than in OLS, which is typical because IV estimation often sacrifices fit for bias correction.
Adjusted \(R^2\) corrects \(R^2\) for the number of regressors to penalize overfitting.
In OLS, adjusted \(R^2 = 0.0725\).
In IV, adjusted \(R^2 = 0.0581\).
Lower \(R^2\) and adjusted \(R^2\) in the IV regression are normal because instrumental variables estimation focuses on identification of causal effects, not on maximizing fit.
Summary:
The IV point estimate of the effect of children on labor supply is much larger in magnitude than the OLS estimate, indicating likely bias in OLS.
However, the IV estimate is very imprecise due to the weak instrument \(samesex\).
HC1 standard errors are more stable in this case than HC2, even if slightly less theoretically robust.
The lower \(R^2\) in IV models reflects the usual trade-off in causal estimation: less explained variance but better control of bias.
Conclusion:
The IV coefficient is considerably larger in magnitude than the OLS coefficient, but it is not statistically significant.
In IV estimation, the standard error is usually larger than in OLS because only the variation in \(x\) that is explained by the instrument \(z\) is used to estimate the effect on \(y\). Since we are focusing on a smaller, instrument-induced variation in \(x\), there is less independent information available to explain the outcome \(y\), leading to greater imprecision in the coefficient estimates. In other words, by isolating the “clean” variation, we reduce the effective sample information, which increases the variance of the estimator.
This principle will be treated more formally in Econometrics 1, where the formula for the asymptotic variance of the IV estimator is:
\[ \text{Var}(\hat{\beta}_{IV}) = \sigma^2 \left( (Z'X)^{-1} (Z'Z) (X'Z)^{-1} \right) \]
where:
\(Z\) is the matrix of instruments,
\(X\) is the matrix of endogenous regressors,
\(\sigma^2\) is the variance of the error term.
The key takeaway is that if the instruments weakly explain \(X\), the terms \((Z'X)\) and \((X'Z)\) will be small, making the variance of \(\hat{\beta}_{IV}\) large, which leads to high standard errors and low precision.
(vii)
The analysis is conducted in R.
Display Code
model3 <- feols(kids ~ samesex + multi2nd + nonmomi + educ + age + age^2 + black + hispan, vcov="hetero", data=data)sum_model3 <- summary(model3, vcov=vcovHC(model3, type="HC2"), fitstat="all")
etable_model0_model3 <- etable(model0, sum_model3)
kbl(etable_model0_model3, format="html", digits=4, booktabs=TRUE) %>% kable_styling(full_width = FALSE)wald(model3, "samesex=0, multi2nd=0", vcov=vcovHC(model3, type="HC2"))[1] NA
The analysis yields the following results.
| model0 | sum_model3 | |
|---|---|---|
| Dependent Var.: | hours | kids |
| Constant | -10.45 (6.589) | 2.043*** (0.2925) |
| kids | -2.326*** (0.1155) | |
| nonmomi | -0.0578*** (0.0054) | -0.0028*** (0.0003) |
| educ | 0.5860*** (0.0375) | -0.0853*** (0.0020) |
| age | 2.049*** (0.4484) | 0.0563** (0.0203) |
| age square | -0.0277*** (0.0077) | 4.36e-5 (0.0004) |
| black | 1.058 (1.351) | 0.0106 (0.0647) |
| hispan | -5.114*** (1.352) | -0.0420 (0.0648) |
| samesex | 0.0704*** (0.0102) | |
| multi2nd | 0.7632*** (0.0548) | |
| _______________ | ___________________ | ___________________ |
| S.E. type | Heteroskedast.-rob. | Custom |
| Observations | 31,857 | 31,857 |
| R2 | 0.07270 | 0.12437 |
| Adj. R2 | 0.07250 | 0.12415 |
Exercise 7.
2SLS standard errors
Wooldridge Exercise 15.C9 (p. 528)
Display Task
Display Solution
(i)
We estimate the equation in R.
The equation is given by:
\[
\log(\text{wage})=\beta_0+\beta_1\text{educ}+\beta_2\text{exper}+\beta_3\text{tenure}+\beta_4\text{black}+u
\] 2SLS (Two-Stage Least Squares)
2SLS is a method used in econometrics to estimate causal effects when one or more explanatory variables are endogenous.
\(\text{Endogenous} = \text{Correlated with the error term} \Longrightarrow \text{causes bias in OLS.}\) \(\text{Exogenous} = \text{Uncorrelated with the error term} \Longrightarrow \text{safe to use in OLS.}\)
Why use 2SLS?
When an explanatory variable (like education) is influenced by other factors that also affect the outcome (like wages), OLS gives biased results.
For example, ability affects both education and wages, but if ability is unobserved, it ends up in the error term which implies that education becomes endogenous.
2SLS solves this by using an “instrument”:
- An instrument (Z) affects the endogenous variable (X) but is not correlated with the error term (u).
Example: distance to college (nearc4) affects education (X), but not directly wages (Y).
How 2SLS works:
Stage 1:
Regress the endogenous variable (X) on the instrument(s) (Z) and any exogenous controls.
Get the predicted values: these are “clean” versions of X, free from the problematic correlation with u.
Stage 2:
Regress the outcome variable (Y) on the predicted values from stage 1 (and any exogenous controls).
This gives a consistent estimate of the causal effect.
Summary:
OLS fails with endogeneity.
2SLSreplaces the problematic variable with a version that’s not correlated with the error.As long as the instrument is valid (relevant and exogenous),
2SLSgives consistent estimates.
In R:
Display Code
data <- read_dta("C:/Users/laust/Documents/Fag/4. Sem/Econometrics/data/wage2.dta")The following line estimates a 2SLS (Two-Stage Least Squares) regression using the fixest package in R.
model <- feols(lwage ~ exper + tenure + black |0| educ ~ sibs, data=data)We are estimating the model:
\[ \log(wage) = \beta_0 + \beta_1 \cdot educ + \beta_2 \cdot exper + \beta_3 \cdot tenure + \beta_4 \cdot black + u \] Because \(educ\) is endogenous, we use \(sibs\) as an instrument.
Explanation of the parts:
lwage ~ exper + tenure + black:
These are the exogenous regressors. They are assumed to be uncorrelated with the error term \(u\).
| 0 |:
This means no fixed effects are included. Fixed effects could be included here if needed.
educ ~ sibs:
This specifies the instrumental variable. It means \(educ\) is endogenous and should be instrumented using \(sibs\).
How 2SLS works:
Stage 1:
Regress \(educ\) on \(sibs\) and the exogenous variables (\(exper\), \(tenure\), \(black\)).
This gives the predicted values of \(educ\) that are uncorrelated with the error term \(u\).
Stage 2:
Regress \(lwage\) on the predicted values of \(educ\) (from stage 1), along with \(exper\), \(tenure\), and \(black\).
This procedure removes the endogeneity problem, giving a consistent estimate of the causal effect of education on wages.
The instrument \(sibs\) is valid if:
It is correlated with \(educ\) (relevance).
It is uncorrelated with \(u\) (exogeneity).
If these hold, the 2SLS estimate of \(\beta_1\) is consistent.
Model summary below:
tidy_model <- tidy(model)
kbl(tidy_model, format="html", digits=4, booktabs = TRUE) %>% kable_styling(full_width=FALSE)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 5.2160 | 0.5435 | 9.5979 | 0.0000 |
| fit_educ | 0.0936 | 0.0337 | 2.7768 | 0.0056 |
| exper | 0.0209 | 0.0084 | 2.4943 | 0.0128 |
| tenure | 0.0115 | 0.0027 | 4.2152 | 0.0000 |
| black | -0.1833 | 0.0501 | -3.6566 | 0.0003 |
We observe that all coefficients in the model are statistically significant at conventional levels (p < 0.05), and we interpret them as follows:
The estimated return to education is 0.0936, meaning one additional year of education increases log(wage) by approximately 9.36%, holding other factors constant.
This is a causal effect, since education was instrumented usingsibs.The coefficient on
experis 0.0209, suggesting that each additional year of work experience increases log(wage) by about 2.09%.The coefficient on
tenureis 0.0115, meaning that staying longer at the same job also increases wages, by 1.15% per year.The variable
blackhas a negative coefficient of -0.1833, indicating that, all else equal, Black individuals earn about 18.33% less than others, on average.The constant term (intercept) is 5.2160, which represents the expected log(wage) when all regressors are zero — not meaningful by itself but needed for the model.
Final words:
All variables are significant, and the estimated return to education is positive, large, and statistically significant, even after correcting for endogeneity using 2SLS.
sibs (number of siblings) is a reasonable instrument for educ (years of education) if it satisfies two key conditions:
Relevance
The instrument must be correlated with the endogenous variable (\(educ\)).
In this case, it is likely that individuals with more siblings receive less education, due to limited family resources (like time, money, or attention).
So, we expect:
\[ Cov(sibs, educ) \neq 0 \]
This is called the instrument relevance condition.Exogeneity
The instrument must be uncorrelated with the error term in the wage equation.
This meanssibsmust affect wages only through its effect on education, not directly or through unobserved factors (like ability).
So, we want:
\[ Cov(sibs, u) = 0 \]
This is the exclusion restriction or instrument exogeneity.
Why sibs makes sense:
It’s usually determined early in life, before schooling and wages are set.
It reflects family background, which can affect education but not necessarily wages directly.
It is plausibly exogenous: unless parents choose family size based on wage expectations for a specific child (unlikely),
sibsshould not be related to the wage error term.
Caveat:
The validity of sibs as an instrument depends on the assumption that it does not influence wages through any other channel than education.
If, for example, family size also affects personality, motivation, or network access (which influence wages), then this assumption may be violated.
Conclusion:
sibs is a reasonable instrument for educ if we believe it affects wages only by influencing how much education someone gets, and not directly.
(ii)
In this task we are asked to manually perform a 2SLS (Two-Stage Least Squares) regression, step by step, instead of using a built-in function like feols().
Step 1: First-stage regression
Regress the endogenous variable educ on the instrument: sibs and the exogenous controls: exper, tenure, black.
This yields the fitted values (predicted values) of education, denoted: \[ \hat{educ}_i \]
These predicted values are the “clean” version of educ, not affected by endogeneity.
Step 2: Second-stage regression
We take the predicted values from step 1, and run this regression:
\[ \log(wage_i) = \beta_0 + \beta_1 \hat{educ}_i + \beta_2 exper_i + \beta_3 tenure_i + \beta_4 black_i + u_i \]
This yields the estimated coefficients \(\hat{\beta}_1, \hat{\beta}_2, \dots\).
What to check:
The coefficients from this manual 2SLS should match the ones from the automatic
feols(... | ... | educ ~ sibs)regression.The standard errors, however, will usually be different — and not valid, because the second-stage OLS does not account for the fact that \(\hat{educ}_i\) is a generated regressor.
Why is this useful?
It demonstrates how
2SLSreally works behind the scenes.It confirms that the logic of instrumental variables leads to the same result.
It is a reminder that only specialized functions like
feols()give correct standard errors forIV/2SLS-models.
In R:
Display Code
In our calculations we use the packages haven, dplyr and broom.
First stage:
# Here we regress educ on `sibs + controls` as follows
stage_1 <- lm(educ ~ sibs + exper + tenure + black, data = data)This runs the first-stage regression where education is predicted using the instrument sibs and the exogenous control variables exper, tenure, black.
This corresponds to: \[ educ_i = \pi_0 + \pi_1 \cdot sibs_i + \pi_2 \cdot exper_i + \pi_3 \cdot tenure_i + \pi_4 \cdot black_i + v_i \]
# Storing the predicted (fitted) values from the first stage as follows
data <- data %>%
mutate(educ_hat = fitted(stage_1))This adds a new column educ_hat to the data, which contains the predicted values of educ based on the first-stage regression. These values are free from endogeneity.
Second stage:
# Regressing 'lwage' on predicted 'educ' and controls as follows
stage_2 <- lm(lwage ~ educ_hat + exper + tenure + black, data = data)This runs the second-stage regression, using the predicted education variable (educ_hat) instead of the original (endogenous) educ.
This corresponds to: \[ \log(wage_i) = \beta_0 + \beta_1 \cdot \hat{educ}_i + \beta_2 \cdot exper_i + \beta_3 \cdot tenure_i + \beta_4 \cdot black_i + u_i \]
Displaying second stage as follows:
tidy_stage_2 <- tidy(stage_2)
kbl(tidy_stage_2, format="html", digits=4, booktabs=TRUE) %>% kable_styling(full_width = FALSE)Current regression yields the following.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 5.2160 | 0.5688 | 9.1699 | 0.0000 |
| educ_hat | 0.0936 | 0.0353 | 2.6530 | 0.0081 |
| exper | 0.0209 | 0.0088 | 2.3831 | 0.0174 |
| tenure | 0.0115 | 0.0029 | 4.0272 | 0.0001 |
| black | -0.1833 | 0.0525 | -3.4936 | 0.0005 |
The previous regression yielded the following.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 5.2160 | 0.5435 | 9.5979 | 0.0000 |
| fit_educ | 0.0936 | 0.0337 | 2.7768 | 0.0056 |
| exper | 0.0209 | 0.0084 | 2.4943 | 0.0128 |
| tenure | 0.0115 | 0.0027 | 4.2152 | 0.0000 |
| black | -0.1833 | 0.0501 | -3.6566 | 0.0003 |
Comparison of Manual 2SLS vs Proper 2SLS
| Term | Coef (manual) | SE (manual) | Coef (feols) | SE (feols) | Difference |
|---|---|---|---|---|---|
| Intercept | 5.2160 | 0.5688 | 5.2160 | 0.5435 | 0.0000 |
| Education | 0.0936 | 0.0353 | 0.0936 | 0.0337 | 0.0000 |
| Experience | 0.0209 | 0.0088 | 0.0209 | 0.0084 | 0.0000 |
| Tenure | 0.0115 | 0.0029 | 0.0115 | 0.0027 | 0.0000 |
| Black | -0.1833 | 0.0525 | -0.1833 | 0.0501 | 0.0000 |
Interpretation:
All coefficients are identical across both models, which confirms that the manual 2SLS is implemented correctly. However, the standard errors differ slightly:
Manual 2SLS SEs are a bit larger, especially for
educ.This is expected because the second stage in manual 2SLS does not adjust for the fact that
educ_hatis estimated, so the standard errors are biased downward.The
feols()function correctly adjusts for this using the full 2SLS variance formula.
Conclusion:
Coefficient estimates are the same — this confirms the math of
2SLS.Standard errors from
feols()are correct and preferred.It should be that one should never report standard errors from manual 2SLS unless they are adjusted using special packages or variance formulas.
It is much more wise to be using feols() (or similar tools like ivreg()) for valid inference.
(iii)
In this task we must intentionally perform an incorrect two-stage regression in order to see what goes wrong when 2SLS is not done properly.
We are tasked with:
Step 1 (incorrect first stage):
Regressing educ only on sibs, not including other exogenous controls (exper, tenure, black).
This gives: \[ \hat{educ}_i = \pi_0 + \pi_1 \cdot sibs_i \]
Step 2:
Taking the predicted values from the incorrect first stage, and plug them into the second-stage regression: \[
\log(wage_i) = \beta_0 + \beta_1 \cdot \hat{educ}_i + \beta_2 \cdot exper_i + \beta_3 \cdot tenure_i + \beta_4 \cdot black_i + u_i
\]
Why is this wrong?:
In proper 2SLS, the first stage must include all exogenous regressors from the second stage — this maintains orthogonality conditions and ensures consistency.
Omitting them breaks the logic of the instrument and means:
The predicted values of
educare not properly isolated from the error term in the second stage.The result is an inconsistent estimate of the effect of education (\(\beta_1\)).
We are not just getting incorrect standard errors — the actual coefficient is biased.
What might be the purpose of the task:
To compare the incorrect estimate of the return to education with the correct 2SLS estimate, and see how much the bias affects the result.
What we want to look for in the output:
The coefficient on
educ(i.e. return to education) in the incorrect model.Compare it with the correct
2SLSestimate.Likely, the incorrect estimate will be different and biased, confirming that 2SLS only works when done properly.
In R:
Display Code
#1. An incorrect first stage - here we only regress educ on sibs
incorrect_stage_1 <- lm(educ ~ sibs, data = data)
# saving the predicted values of educ in the dataset
data$educ_tilde <- fitted(incorrect_stage_1)
#2. Regressing lwage on the incorrect predicted values and other controls
incorrect_stage_2 <- lm(lwage ~ educ_tilde + exper + tenure + black, data = data)#Loading the results into a table...
tidy_incorrect_stage_2 <- tidy(incorrect_stage_2)
kbl(tidy_incorrect_stage_2, format="html", digits=4, booktabs=TRUE) %>% kable_styling(full_width=FALSE)The incorrect model:
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 5.7710 | 0.3604 | 16.0139 | 0.0000 |
| educ_tilde | 0.0700 | 0.0264 | 2.6530 | 0.0081 |
| exper | -0.0004 | 0.0031 | -0.1262 | 0.8996 |
| tenure | 0.0140 | 0.0027 | 5.1932 | 0.0000 |
| black | -0.2416 | 0.0415 | -5.8185 | 0.0000 |
Comparing with the model from (i):
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 5.2160 | 0.5435 | 9.5979 | 0.0000 |
| fit_educ | 0.0936 | 0.0337 | 2.7768 | 0.0056 |
| exper | 0.0209 | 0.0084 | 2.4943 | 0.0128 |
| tenure | 0.0115 | 0.0027 | 4.2152 | 0.0000 |
| black | -0.1833 | 0.0501 | -3.6566 | 0.0003 |
We observe that not accounting for all explanetory variables that is not correlated with the error term introduces bias into the analysis making the results more dis satisfactory and incorrect.
Model comparinson and interpretation follows.
The Incorrect Model (from part iii)
| Variable | Coefficient | p-value | Interpretation |
|---|---|---|---|
| (Intercept) | 5.7710 | 0.0000 | Baseline log(wage) when all variables are 0. |
| educ_tilde | 0.0700 | 0.0081 | A 1-year increase in education (from incorrect first stage) increases log(wage) by ~7%. Statistically significant. |
| exper | -0.0004 | 0.8996 | No effect. Not significant at all. Likely distorted by incorrect first stage. |
| tenure | 0.0140 | 0.0000 | Each year of tenure increases log(wage) by ~1.4%. Highly significant. |
| black | -0.2416 | 0.0000 | Being Black is associated with ~24.2% lower log(wage). Highly significant. |
The Correct Model (from part i)
| Variable | Coefficient | p-value | Interpretation |
|---|---|---|---|
| (Intercept) | 5.2160 | 0.0000 | Baseline log(wage) when all variables are 0. |
| fit_educ | 0.0936 | 0.0056 | A 1-year increase in education (correct 2SLS) increases log(wage) by ~9.4%. Statistically significant. |
| exper | 0.0209 | 0.0128 | Each year of experience increases log(wage) by ~2.1%. Significant. |
| tenure | 0.0115 | 0.0000 | Each year of tenure increases log(wage) by ~1.15%. Highly significant. |
| black | -0.1833 | 0.0003 | Being Black is associated with ~18.3% lower log(wage). Highly significant. |
Summary:
The incorrect model underestimates the return to education.
It produces a nonsensical coefficient for experience (wrong sign, not significant).
This shows that excluding controls from the first stage leads to biased estimates, not just incorrect standard errors.